Overview

My project will be a limited meta-analysis of existing bulk RNAseq data sets detailing the differentiation of induced-pluripotent stem cells (IPSCs) into microglia, a key population of resident immune cells in the central nervous system (CNS). IPSC-based modeling is the future of microglia biology in vitro, thus there are many competing standards - eg different IPSC-derived microglia produced by different groups. My analysis will compare the transcriptomes of these competing microglia to a “Gold Standard” or reference transcriptomic dataset generated using microglia isolated from post-mortem biopsies from human donors.

Project Repository: https://github.com/gesualdiJ/BMIN503_Final_Project

Introduction

Microglia have long been challenging to model in-vitro using conventional approaches such as immortalized cell lines or primary culture because their morphology, function, and (more recently appreciated) transcriptome rapidly deteriorate when cultured in isolation. To address this issue, numerous groups have developed microglial cell lines derived from IPSCs through supplementation of cultures with various factors normally present in the micro-encironment of the CNS. The goal of this project is to interrogte the merit of several published IPSC-derived microglia lines along with some more recently developed applications of these cells involving chimeric animal model systems.

Addressing this problem involves suject matter knowledge of both immunology and neuroscience, as well as various bioinformatic techniques. This analysis will include only a subset of published microglial lines, prioritizing those models that myself and Drs Schaletzki and Su considered influential. The reference transcriptomes - one from adult donors and one generated from analysis of pediatric tissue - add rigor to this project because they represent the closest that we can realistically hope to get to the in vivo gene expression of microglia. The microglia that were analyzed to generate these reference transcriptomes were isolated immediately after surgical resection from patients, therefore there has been minimal time for these cells to lose their in vivo character. Finally, this analysis will be useful to microglia biologists because while many transcriptomic data sets exist, there have been few if any comprehensive meta-analyses of these data and still fewer attempts to harmonize their insights. This project aims to make progress in that direction.

Methods

The general overview of the methods for this project is as follows:

More detailed descriptions are at the tops of code blocks 3 - 9 as well as interspersed throughout the code sections via comments.

#install and load packages—-

install.packages("BiocManager")
BiocManager::install(c("GEOquery", 
                       "oligo", 
                       "limma", 
                       "hgu133plus2.db", 
                       "pd.hg.u133.plus.2", 
                       "viridis", 
                       "fgsea", 
                       "edgeR",
                       "gplots"))
install.packages("tidyverse")
install.packages("reshape2")
install.packages("edgeR")
install.packages("readxl")
install.packages("ggplot2")
install.packages("fgsea")
install.packages("stringr")
install.packages("readr")
install.packages("purrr")
install.packages("plotly")
install.packages("gt")
install.packages("magrittr")
install.packages("geosphere")
BiocManager::install("ensembldb") 
BiocManager::install("AnnotationHub") 
BiocManager::install("rhdf5") 
BiocManager::install("mixOmics") 
BiocManager::install('EnhancedVolcano')
BiocManager::install("factoextra")
BiocManager::install("GSVA")
BiocManager::install("gprofiler2")
BiocManager::install("GSEABase")
BiocManager::install("clusterProfiler")
BiocManager::install("enrichplot")
install.packages("DT")
library(BiocManager)
library(tidyverse)
library(reshape2)
library(rhdf5)
library(edgeR)
library(readxl)
library(GEOquery)
library(oligo)
library(limma)
library(viridis)
library(pd.hg.u133.plus.2)
library(hgu133plus2.db)
library(ggplot2)
library(gplots)
library(fgsea)
library(stringr)
library(readr)
library(ensembldb)
library(AnnotationHub)
library(purrr)
library(plotly)
library(mixOmics)
library(gt)
library(EnhancedVolcano)
library(magrittr)
library(factoextra)
library(GSVA)
library(gprofiler2)
library(GSEABase)
library(clusterProfiler)
library(enrichplot)
library(geosphere)
library(DT)

The Gosselin reference dataset below will contain 24 samples, all Ex Vivo microglia samples isolated following surgical resections from pediatric patients. The number of observations (Expressed genes) for this dataset is 14,528. These data are published and available on GEO but here I will read in the raw counts, sample labels, and gene names directly from a .xlsx file that was provided to me by a colleague who worked on this study. Therefore, we can read the counts into a dataframe and begin cleaning, filtering, and normalizing the data without having to access any transcriptomic repositories or databases as we will do for the later datasets. This makes the processing for this data set much simpler but the code below is less generalizeable as a result. The following datasets will make use of either EnsembleDB or ArchS4 and are therefore more informative for irectly accessing published data without any personal connections

#Gosselin Data, Fetal Ex-Vivo Microglia (Reference) ----

#The processed data frames containing these data are stored in tbl's 
#"ExVivo_Juvenile_MG_Reference_Gosselin" (no GeneID column, gene names are row names) 
#or "Gosselin_log2_cpm_filtered_norm" (maintains GeneID as a separate column)

#Read in raw counts
microglia_fetal_ExVivo_Gosselin <- read_xlsx("Human_RNA-seq_data_Gosselin_Fetal_ExVivo.xlsx", sheet = 1, skip = 1) 
microglia_fetal_ExVivo_Gosselin <- dplyr::rename(microglia_fetal_ExVivo_Gosselin, Gene_ID = ...1)

#remove any duplicated genes symbols (2) and ove gene labels from column to rowname
microglia_fetal_ExVivo_Gosselin <- microglia_fetal_ExVivo_Gosselin[!duplicated(microglia_fetal_ExVivo_Gosselin$Gene_ID), ]  
microglia_fetal_ExVivo_Gosselin <- column_to_rownames(microglia_fetal_ExVivo_Gosselin, var = "Gene_ID") 

#Removes 40 columns of unneeded monocyte/in vitro/lysate data
microglia_fetal_ExVivo_Gosselin <- microglia_fetal_ExVivo_Gosselin %>%    
  dplyr::select(!contains("InVitro") & !contains("Monocytes") & !contains("Lysate"))  
sample_Labels_Gosselin <- colnames(microglia_fetal_ExVivo_Gosselin)

#Convert to DGE list to facilitate normalization and count visualization
#Then get counts per million (cpm) to visualize unexpressed / lowly expressed probes
DGEList_Gosselin <- DGEList(microglia_fetal_ExVivo_Gosselin) 
Gosselin_cpm <- cpm(DGEList_Gosselin$counts)                

#Table shows number of non-expressed genes across all samples, indicates roughly how many should be filtered out
table(rowSums(DGEList_Gosselin$counts==0)==24)               
## 
## FALSE  TRUE 
## 22061  2750
#Create and reshape (melt) tibble of log transformed raw counts to visualized unfiltered / unnormalized data
#reshape2::melt() is a method that transforms tbl.df's from a wide or user friendly format
# to a 'tall' or processing friendly format. It reshapes the data frame to have three columns:
#GeneID (gene name), Variable (the sampel name), and Value (the raw read count for that gene by that sample)
#this reshaping allows us to easily visualize the entire distribution of counts for each sample in the ggplot below
Gosselin_log2_cpm <- as_tibble(cpm(DGEList_Gosselin$counts, log = TRUE) , rownames = "GeneID") 
Gosselin_log2_cpm_melt <- melt(Gosselin_log2_cpm)
head(Gosselin_log2_cpm_melt)
##     GeneID                         variable      value
## 1     A1BG RNA_HMG002_S002_Microglia_ExVivo -0.7055321
## 2 A1BG-AS1 RNA_HMG002_S002_Microglia_ExVivo  0.2073467
## 3     A1CF RNA_HMG002_S002_Microglia_ExVivo -2.9884547
## 4      A2M RNA_HMG002_S002_Microglia_ExVivo 13.0833842
## 5  A2M-AS1 RNA_HMG002_S002_Microglia_ExVivo  7.2165096
## 6    A2ML1 RNA_HMG002_S002_Microglia_ExVivo -1.5456431
#visualize overall expression to show excessive influence of unexpressed / lowly expressed genes
ggplot(Gosselin_log2_cpm_melt, aes(x=variable, y=value, fill=variable)) +     
  geom_violin(trim = FALSE, show.legend = FALSE) +
  stat_summary(fun.y = "median", 
               geom = "point", 
               shape = 124, 
               size = 6, 
               color = "black", 
               show.legend = FALSE) +
  theme(axis.text.x=element_blank()) +
  labs(y="log2 expression",
       title="Log2 Counts per Million (CPM)",
       subtitle="Gosselin, unfiltered, non-normalized",
       caption=paste0("produced on ", Sys.time()))

#create logical vector that evaluates to TRUE when at least 6 of 24 samples express a given gene 
filtrate_gosselin <- rowSums(Gosselin_cpm > 1) >= 6  
table(filtrate_gosselin)
## filtrate_gosselin
## FALSE  TRUE 
## 10283 14528
#subset DGE list so that only genes expressed in at least 6 samples remain
#the get counts per million / reshape to re-visualize distribution of expression without unexpressed probes
DGEList_Gosselin_filtered <- DGEList_Gosselin[filtrate_gosselin, ] 
Gosselin_log2_cpm_filtered <- as_tibble(cpm(DGEList_Gosselin_filtered, log = TRUE), rownames = "GeneID") 
Gosselin_log2_cpm_filtered_melt <- melt(Gosselin_log2_cpm_filtered)
ggplot(Gosselin_log2_cpm_filtered_melt, aes(x=variable, y=value, fill=variable)) +     
  geom_violin(trim = FALSE, show.legend = FALSE) +
  stat_summary(fun.y = "median", 
               geom = "point", 
               shape = 124, 
               size = 6, 
               color = "black", 
               show.legend = FALSE) +
  theme(axis.text.x=element_blank()) +
  labs(y="log2 expression",
       title="Log2 Counts per Million (CPM)",
       subtitle="Gosselin, filtered, non-normalized",
       caption=paste0("produced on ", Sys.time()))

#TPM normalization of filtered DGE list, then store data in tbl.df's
DGEList_Gosselin_filtered_norm <- calcNormFactors(DGEList_Gosselin_filtered, method = "TMM") 
Gosselin_log2_cpm_filtered_norm <- as_tibble(cpm(DGEList_Gosselin_filtered_norm, log = TRUE), rownames = "GeneID") 
ExVivo_Juvenile_MG_Reference_Gosselin <- column_to_rownames(Gosselin_log2_cpm_filtered_norm, var = "GeneID") 

#clean up lengthy sample labels for this dataset. haters will suggest rewriting with vapply. I will never learn vapply!
labels_Gosselin <- rep(NA, 24)  
i <- 1
while (i <= 24){
  labels_Gosselin[i] = paste0(paste0("J", substring(sample_Labels_Gosselin[i], 16, 32)), as.character(i))
  i = i + 1
}
colnames(ExVivo_Juvenile_MG_Reference_Gosselin) <- labels_Gosselin
colnames(Gosselin_log2_cpm_filtered_norm)[2:ncol(Gosselin_log2_cpm_filtered_norm)] <- labels_Gosselin

The plots generated above, as well as the subsequent ones for future datasets, show that a high density of probes are either very lowly expressed or not expressed at all across all samples in the dataset. This is typical as bulk sequencing reads an extremely high number of probes. It is standard procedure to filter these unexpressed probes out, leading to the more uniform distribution observed in the second violin plot.

The Eggen reference dataset below will contain 39 total samples, all Ex Vivo microglia surgically resected from adult patients. The number of observations (Expressed genes) for this dataset is 14,397. This dataset was accessed by directly downloading the .tsv file containing the raw counts from Gene Expression Omnibus (GEO). However, these data contained only transcript IDs instead of gene names like all the other datasets. Therefore, I use AnnotationHub() to access the Ensemble Data Base in order to map transcript IDs to the appropriate gene names. This is comparable to the use of hgu133plus2.db in Practicum 9, but is optimized for use with more contemporary bulk RNA sequencing analysis rather than microarray data. This is an alternative approach to the use of the ArchS4 database, which I employ for the following 3 datasets. This Emsemble Data Base method was included chiefly to showcase alternative approaches to access GEP data without downloading and reading in the large (16 GB) ArchS4 database. Therefore, this approach is more beginner friendly and allows the user to avoid working with the somewhat cumbersome HDF5 files that ArchS4 uses to organize data.

#Eggen Data, Adult Ex-Vivo Microglia (Reference)----

#The processed data frames containing these data are stored in tbl's 
#"ExVivo_Adult_MG_Reference_Eggen" (no GeneID column, gene names are row names) 
#or "Eggen_log2_cpm_filtered_norm" (maintains GeneID as a separate column)

ah <- AnnotationHub()
ahDb <- query(ah, pattern = c("Homo Sapiens", "EnsDb"))  

#Query for available H.Sapiens EnsDb databases, using for mapping probe id's to gene names
EndDB_Hs <- ahDb[["AH104864"]]
EnsDB_Hs <- EndDB_Hs  
EnsDB_Hs_DF <- genes(EnsDB_Hs, return.type = "data.frame") %>%
  dplyr::select(gene_id, gene_name)
EnsDB_Hs_DF <- dplyr::rename(EnsDB_Hs_DF, Tx_ID = gene_id)

#read in data, will need to add gene names using EnsDB information
microglia_adult_ExVivo_Eggen <- read_tsv("GSE99074_HumanMicrogliaBrainCounts.txt")       
microglia_adult_ExVivo_Eggen <- dplyr::rename(microglia_adult_ExVivo_Eggen, Tx_ID = ...1)

#subset list of matched tx_id's and gene names to only include tx's present in dataset
EnsDB_Hs_DF <- dplyr::filter(EnsDB_Hs_DF, EnsDB_Hs_DF$Tx_ID %in% microglia_adult_ExVivo_Eggen$Tx_ID)  
microglia_adult_ExVivo_Eggen <- inner_join(microglia_adult_ExVivo_Eggen, EnsDB_Hs_DF, by = "Tx_ID")
microglia_adult_ExVivo_Eggen <- relocate(microglia_adult_ExVivo_Eggen, gene_name, .before = spm09)
microglia_adult_ExVivo_Eggen <- microglia_adult_ExVivo_Eggen[ , 1:67]
microglia_adult_ExVivo_Eggen <- dplyr::rename(microglia_adult_ExVivo_Eggen, GeneID = gene_name)

#5116 transcripts ID's for which there is no corresponding gene name, we remove using dplyr::filter
table(microglia_adult_ExVivo_Eggen$GeneID == "") 
## 
## FALSE  TRUE 
## 22721  5116
microglia_adult_ExVivo_Eggen <- dplyr::filter(microglia_adult_ExVivo_Eggen, GeneID != "")
microglia_adult_ExVivo_Eggen <- microglia_adult_ExVivo_Eggen[ , 2:67]
microglia_adult_ExVivo_Eggen <- microglia_adult_ExVivo_Eggen[1:40]
microglia_adult_ExVivo_Eggen <- microglia_adult_ExVivo_Eggen[!duplicated(microglia_adult_ExVivo_Eggen$GeneID), ]
microglia_adult_ExVivo_Eggen <- column_to_rownames(microglia_adult_ExVivo_Eggen, var = "GeneID")

#Create DGElist, find counts per million, visualize distribution of data, and filter out unexpressed probes
DGEList_Eggen <- DGEList(microglia_adult_ExVivo_Eggen)      
Eggen_cpm <- cpm(DGEList_Eggen)                
table(rowSums(DGEList_Eggen$counts==0)==20)     
## 
## FALSE  TRUE 
## 22270   237
#Create and reshape (melt) tibble of log transformed raw counts to visualized unfiltered / unnormalized data
Eggen_log2_cpm <- as_tibble(cpm(DGEList_Eggen, log = TRUE) , rownames = "GeneID") 
Eggen_log2_cpm_melt <- melt(Eggen_log2_cpm)

#visualize overall expression to show excessive influence of unexpressed / lowly expressed genes
ggplot(Eggen_log2_cpm_melt, aes(x=variable, y=value, fill=variable)) +     
  geom_violin(trim = FALSE, show.legend = FALSE) +
  stat_summary(fun.y = "median", 
               geom = "point", 
               shape = 124, 
               size = 6, 
               color = "black", 
               show.legend = FALSE) +
  theme(axis.text.x=element_blank()) +
  labs(y="log2 expression",
       title="Log2 Counts per Million (CPM)",
       subtitle="Eggen, unfiltered, non-normalized",
       caption=paste0("produced on ", Sys.time()))

#create logical vector that evaluates to TRUE when at least 20 of 39 samples express a given gene 
filtrate_eggen <- rowSums(Eggen_cpm > 1) >= 20  
table(filtrate_eggen)
## filtrate_eggen
## FALSE  TRUE 
##  8110 14397
#subset DGE list so that only genes expressed in at least 24 samples remain, remove 8110 unexpressed genes
#then reformat to log2 counts per million to reshape and revisualize distribution of data
DGEList_Eggen_filtered <- DGEList_Eggen[filtrate_eggen, ]
Eggen_log2_cpm_filtered <- as_tibble(cpm(DGEList_Eggen_filtered, log = TRUE), rownames = "GeneID") 

Eggen_log2_cpm_filtered_melt <- melt(Eggen_log2_cpm_filtered)
ggplot(Eggen_log2_cpm_filtered_melt, aes(x=variable, y=value, fill=variable)) +     
  geom_violin(trim = FALSE, show.legend = FALSE) +
  stat_summary(fun.y = "median", 
               geom = "point", 
               shape = 124, 
               size = 6, 
               color = "black", 
               show.legend = FALSE) +
  theme(axis.text.x=element_blank()) +
  labs(y="log2 expression",
       title="Log2 Counts per Million (CPM)",
       subtitle="Eggen, filtered, non-normalized",
       caption=paste0("produced on ", Sys.time()))

#TPM normalization, store proccessed data as tbl.df
DGEList_Eggen_filtered_norm <- calcNormFactors(DGEList_Eggen_filtered, method = "TMM") 
Eggen_log2_cpm_filtered_norm <- as_tibble(cpm(DGEList_Eggen_filtered_norm, log = TRUE), rownames = "GeneID") 
ExVivo_Adult_MG_Reference_Eggen <- column_to_rownames(Eggen_log2_cpm_filtered_norm, var = "GeneID") 

#Clean up non-intuitive sample labels. 
labels_Eggen <- rep(NA, ncol(ExVivo_Adult_MG_Reference_Eggen))  
j <- 1

while (j <= length(labels_Eggen)){
  labels_Eggen[j] = paste0("A_Microllia_ExVivo", as.character(j))
  j = j + 1
}
colnames(ExVivo_Adult_MG_Reference_Eggen) <- labels_Eggen

#Format maintaining GeneID column, needed for join function
Eggen_log2_cpm_filtered_norm <- rownames_to_column(ExVivo_Adult_MG_Reference_Eggen, var = "GeneID")
#KJS Lab internal data, IPSC-Microglia and Monocyte Derived Macrophages, Pre-Processed ----

#This data set was previously prepared for my thesis lab. 
#Contains 3 IPSC-Microglia samples and 3 Monocyte-derived Macrophage (MDM) samples
#The total number of observations (expressed genes) for this dataset is 14,139
#Only handling here is adding the prefix "KJS_" to the start of the existing 
#sample labels to facilitate downstream aggregation / analysis
#Tbl "KJS_iMG_MDM" contains this dataset and maintains a GeneID column

KJS_iMG_MDM <- read_xlsx("KJS_iMG_MDM_TMM_Normalized.xlsx")
labels_KJS <- c(rep(NA, ncol(KJS_iMG_MDM)))
labels_KJS[1] <- "GeneID"
for(n in 2: ncol(KJS_iMG_MDM)){
  labels_KJS[n] <- paste0("KJS_", colnames(KJS_iMG_MDM[n]))
}
colnames(KJS_iMG_MDM) <- labels_KJS

The Abud dataset below will contain 15 total samples, 9 in vitro differentiated IPSC microglia cultured independently and 6 cultured with rat hippocampal neurons (labeled rHcN). The total number of observations (expressed genes) in this dataset is 12,968. This dataset as well as the two following after will be accessed and cleaned using the Archs4 database, which contains all of the samples ever uploaded to the Gene Expression Omnibus (GEO) Repository. It includes - among other information - raw counts for samples as well as descriptions of experimental groupings and overall study design. This is the most efficient way to access and clean data from GEO and saves you the trouble of fishing for methodological details in the text of the paper.

#Abud 2017 data (IPSC-Microglia +/- coculture with Rat Hippocampal neurons)----

#The processed data frames containing these data are stored in tbl's "IPSC_Microglia_Abud" 
#(no GeneID column, gene names are row names) 
#or "DGEList_Abud_filtered_norm" (maintains GeneID as a separate column)

#bring in downloaded ArchS4 database containing all GEO accessible transcriptomic data
archs4.human <- "/Users/jamesgesualdi/Desktop/BMIN_5030/BMIN503_Final_Project_Local/human_matrix_v11.h5"  
h5ls(archs4.human)  #List the contents of the loaded Database
##            group                  name       otype  dclass            dim
## 0              /                  data   H5I_GROUP                       
## 1          /data            expression H5I_DATASET INTEGER 441356 x 35238
## 2              /                  meta   H5I_GROUP                       
## 3          /meta                 genes   H5I_GROUP                       
## 4    /meta/genes            chromosome H5I_DATASET  STRING          35238
## 5    /meta/genes       ensembl_gene_id H5I_DATASET  STRING          35238
## 6    /meta/genes          gene_biotype H5I_DATASET  STRING          35238
## 7    /meta/genes           gene_symbol H5I_DATASET  STRING          35238
## 8    /meta/genes                 genes H5I_DATASET  STRING          35238
## 9    /meta/genes        start_position H5I_DATASET  STRING          35238
## 10         /meta                  info   H5I_GROUP                       
## 11    /meta/info                author H5I_DATASET  STRING          ( 0 )
## 12    /meta/info               contact H5I_DATASET  STRING          ( 0 )
## 13    /meta/info         creation-date H5I_DATASET  STRING          ( 0 )
## 14    /meta/info            laboratory H5I_DATASET  STRING          ( 0 )
## 15    /meta/info               version H5I_DATASET INTEGER          ( 0 )
## 16         /meta               samples   H5I_GROUP                       
## 17 /meta/samples         channel_count H5I_DATASET  STRING         441356
## 18 /meta/samples   characteristics_ch1 H5I_DATASET  STRING         441356
## 19 /meta/samples       contact_address H5I_DATASET  STRING         441356
## 20 /meta/samples          contact_city H5I_DATASET  STRING         441356
## 21 /meta/samples       contact_country H5I_DATASET  STRING         441356
## 22 /meta/samples     contact_institute H5I_DATASET  STRING         441356
## 23 /meta/samples          contact_name H5I_DATASET  STRING         441356
## 24 /meta/samples           contact_zip H5I_DATASET  STRING         441356
## 25 /meta/samples       data_processing H5I_DATASET  STRING         441356
## 26 /meta/samples  extract_protocol_ch1 H5I_DATASET  STRING         441356
## 27 /meta/samples         geo_accession H5I_DATASET  STRING         441356
## 28 /meta/samples      instrument_model H5I_DATASET  STRING         441356
## 29 /meta/samples      last_update_date H5I_DATASET  STRING         441356
## 30 /meta/samples     library_selection H5I_DATASET  STRING         441356
## 31 /meta/samples        library_source H5I_DATASET  STRING         441356
## 32 /meta/samples      library_strategy H5I_DATASET  STRING         441356
## 33 /meta/samples          molecule_ch1 H5I_DATASET  STRING         441356
## 34 /meta/samples          organism_ch1 H5I_DATASET  STRING         441356
## 35 /meta/samples           platform_id H5I_DATASET  STRING         441356
## 36 /meta/samples          readsaligned H5I_DATASET INTEGER         441356
## 37 /meta/samples            readstotal H5I_DATASET INTEGER         441356
## 38 /meta/samples              relation H5I_DATASET  STRING         441356
## 39 /meta/samples             series_id H5I_DATASET  STRING         441356
## 40 /meta/samples singlecellprobability H5I_DATASET   FLOAT         441356
## 41 /meta/samples       source_name_ch1 H5I_DATASET  STRING         441356
## 42 /meta/samples                status H5I_DATASET  STRING         441356
## 43 /meta/samples       submission_date H5I_DATASET  STRING         441356
## 44 /meta/samples             taxid_ch1 H5I_DATASET  STRING         441356
## 45 /meta/samples                 title H5I_DATASET  STRING         441356
## 46 /meta/samples                  type H5I_DATASET  STRING         441356
#Read in all 441,356 GEO Accession numbers (eg all total human samples on GEO)
all.samples.human <- h5read(archs4.human, name="meta/samples/geo_accession")    

#Query the ArchS4 database for the samples I need from the Abud study, based on descriptions in the paper and on the GEO Entry
Abud_Samples <- c("GSM2360252", #10318X2
                  "GSM2360253", #7028X2
                  "GSM2360254", #x2-1
                  "GSM2360255", #x2-2
                  "GSM2360256", #x2-3
                  "GSM2360257", #x2-4
                  "GSM2360283", #HC-1
                  "GSM2360284", #HC-2
                  "GSM2360285", #HC-3
                  "GSM2360286", #HC-4
                  "GSM2360287", #HC-5
                  "GSM2360288", #HC-6
                  "GSM2445478", #n200 -2
                  "GSM2445479", #n200 -3
                  "GSM2445480"  #n200 -4
)   

#Find the indices of the above samples using their accession numbers
Abud_Sample_Locations <- which(all.samples.human %in% Abud_Samples)     
genes <- h5read(archs4.human, "meta/genes/genes")   #Bring in gene labels provided by ArchS4

#Read in raw counts for my samples of interest
Expression_Abud <- t(h5read(archs4.human, "data/expression",                  
                     index=list(Abud_Sample_Locations, 1:length(genes))))   
H5close()
rownames(Expression_Abud) <- genes
colnames(Expression_Abud) <- all.samples.human[Abud_Sample_Locations]

#Below, I access some meta-data from the study to verify that the appropriate 
#data was extracted from ArchS4, and clean up colnames to more informative labels
Sample_source_name_ch1 <- h5read(archs4.human, "meta/samples/source_name_ch1")     
Sample_title <- h5read(archs4.human, name= "meta/samples/title")                   
Sample_characteristics <- h5read(archs4.human, name="meta/samples/characteristics_ch1")   

studyDesign_Abud <- tibble(Sample_title_Abud = Sample_title[Abud_Sample_Locations], 
                      Sample_source_name_ch1 = Sample_source_name_ch1[Abud_Sample_Locations],
                      Sample_characteristics = Sample_characteristics[Abud_Sample_Locations])

head(studyDesign_Abud)                                #Just for context
## # A tibble: 6 × 3
##   Sample_title_Abud Sample_source_name_ch1 Sample_characteristics               
##   <chr[1d]>         <chr[1d]>              <chr[1d]>                            
## 1 x2-4              iPS microglia          "stimulus/treatment: None\ttime (hou…
## 2 HC-5              iPS microglia          "stimulus/treatment: Rat hippocampal…
## 3 10318X2           iPS microglia          "stimulus/treatment: None\ttime (hou…
## 4 7028X2            iPS microglia          "stimulus/treatment: None\ttime (hou…
## 5 x2-1              iPS microglia          "stimulus/treatment: None\ttime (hou…
## 6 n200 -4           human iPS microglia    "stimulus/treatment: None\ttime (hou…
colnames(Expression_Abud) <- studyDesign_Abud$Sample_title_Abud

#Create DGE list and get cpm values for filtering / normalization
DGEList_Abud <- DGEList(Expression_Abud)     
Abud_cpm <- cpm(DGEList_Abud)                
table(rowSums(DGEList_Abud$counts==0)==15)              
## 
## FALSE  TRUE 
## 26664  8574
#Create and reshape (melt) tibble of log transformed raw counts to visualized unfiltered / unnormalized data
Abud_log2_cpm <- as_tibble(cpm(DGEList_Abud, log = TRUE) , rownames = "GeneID") 
Abud_log2_cpm_melt <- melt(Abud_log2_cpm)

#visualize overall expression to show excessive influence of unexpressed / lowly expressed genes
ggplot(Abud_log2_cpm_melt, aes(x=variable, y=value, fill=variable)) +     
  geom_violin(trim = FALSE, show.legend = FALSE) +
  stat_summary(fun.y = "median", 
               geom = "point", 
               shape = 124, 
               size = 6, 
               color = "black", 
               show.legend = FALSE) +
  theme(axis.text.x=element_blank()) +
  labs(y="log2 expression",
       title="Log2 Counts per Million (CPM)",
       subtitle="Abud, unfiltered, non-normalized",
       caption=paste0("produced on ", Sys.time()))

#create logical vector that evaluates to TRUE when at least 4 of 15 samples express a given gene
filtrate_abud <- rowSums(Abud_cpm > 1) >= 4   
table(filtrate_abud)
## filtrate_abud
## FALSE  TRUE 
## 22270 12968
#subset DGE list so that only genes expressed in at least 3 samples remain, remove 24,138 unexpressed genes
#Redo cpm / log transformation and reshape / revisualize data
DGEList_Abud_filtered <- DGEList_Abud[filtrate_abud, ] 
Abud_log2_cpm_filtered <- as_tibble(cpm(DGEList_Abud_filtered, log = TRUE), rownames = "GeneID") 

Abud_log2_cpm_filtered_melt <- melt(Abud_log2_cpm_filtered)     
ggplot(Abud_log2_cpm_filtered_melt, aes(x=variable, y=value, fill=variable)) +   
  geom_violin(trim = FALSE, show.legend = FALSE) +
  stat_summary(fun.y = "median", 
               geom = "point", 
               shape = 124, 
               size = 6, 
               color = "black", 
               show.legend = FALSE) +
  theme(axis.text.x=element_blank()) +   
  labs(y="log2 expression",
       title="Log2 Counts per Million (CPM)",
       subtitle="Abud, filtered, non-normalized",
       caption=paste0("produced on ", Sys.time()))

#TPM normalization
DGEList_Abud_filtered_norm <- calcNormFactors(DGEList_Abud_filtered, method = "TMM") 
DGEList_Abud_filtered_norm <- as_tibble(cpm(DGEList_Abud_filtered_norm, log = TRUE), rownames = "GeneID") 
IPSC_Microglia_Abud <- column_to_rownames(DGEList_Abud_filtered_norm, var = "GeneID")                 
IPSC_Microglia_Abud <- IPSC_Microglia_Abud  %>%
  dplyr::select(order(colnames(IPSC_Microglia_Abud)))

#clean up non-intuitive sample labels
labels_Abud <- c(rep("IPSC_Microglia_Abud", 2), rep("IPSC_Microglia_Abud_rHcN", 6), rep("IPSC_Microglia_Abud", 7))  
labels_Abud_Numbers <- c("1", "2", "1", "2", "3", "4", "5", "6", "3", "4", "5", "6", "7", "8", "9")
for(l in 1:length(labels_Abud)){
  labels_Abud[l] <- paste0(labels_Abud[l], labels_Abud_Numbers[l])
}
colnames(IPSC_Microglia_Abud) <- labels_Abud

#reorder again for convenient grouping of now renamed samples,
#Add inproved labels to DGEList_Abud_filtered_norm
IPSC_Microglia_Abud <- dplyr::select(IPSC_Microglia_Abud, order(colnames(IPSC_Microglia_Abud)))
DGEList_Abud_filtered_norm <- rownames_to_column(IPSC_Microglia_Abud, var = "GeneID")

The Blurton-Jones dataset below will contain 16 total samples, 10 xenotransplanted microglia and 6 in vitro differentiated IPSC microglia. The total number of observations (expressed genes) in this dataset is 13,421.

#Blurton jones 2019 data (IPSC-Microglia +/- Transplant ("Xenograft") into chimeric mice) ----

#This processed dataset will contain 16 total samples (10 transplanted iMg, 
#6 in vitro. Transplanted iMg are notated 'xMG')
#The number of observations (expressed genes) for this dataset is 13,421
#The processed data frames containing these data are stored in tbl's 
#"IPSC_Microglia_Blurton" (no GeneID column, gene names are row names) 
#or "DGEList_Blurton_filtered_norm" (maintains GeneID as a separate column)

Blurton_Samples <- c("GSM3908536",  #MBJ_iMGL_SAL_1
  "GSM3908537", #MBJ_iMGL_SAL_2
  "GSM3908538", #MBJ_iMGL_SAL_3
  "GSM3908539", #MBJ_iMGL_LPS_1
  "GSM3908540", #MBJ_iMGL_LPS_2
  "GSM3908541", #MBJ_iMGL_LPS_3
  "GSM3908542", #MBJ_xMG_GFP_1
  "GSM3908543", #MBJ_xMG_GFP_2
  "GSM3908544", #MBJ_xMG_GFP_3
  "GSM3908545", #MBJ_xMG_GFP_LPS_1
  "GSM3908546", #MBJ_xMG_GFP_LPS_2
  "GSM3908547", #MBJ_xMG_GFP_LPS_3
  "GSM3908548", #MBJ_xMG_GFP_LPS_4
  "GSM3908549", #MBJ_xMG_GFP_LPS_5
  "GSM3908550", #MBJ_xMG_CDI_1
  "GSM3908551", #MBJ_xMG_CDI_2
  "GSM3908552", #MBJ_xMG_CDI_3
  "GSM3908553", #MBJ_xMG_CDI_4
  "GSM3908554", #MBJ_xMG_CDI_5
  "GSM3908555", #MBJ_xMG_CDI_6
  "GSM3908556", #MBJ_xMG_10C_1
  "GSM3908557", #MBJ_xMG_10C_2
  "GSM3908558", #MBJ_xMG_10C_3
  "GSM3908559", #MBJ_xMG_10C_4
  "GSM3908560", #MBJ_ExVivo_1
  "GSM3908561", #MBJ_ExVivo_2
  "GSM3908562", #MBJ_iMGL_GFP_1
  "GSM3908563", #MBJ_iMGL_GFP_2
  "GSM3908564", #MBJ_iMGL_GFP_3
  "GSM3908565", #MBJ_iMGL_GFP_4
  "GSM3908566", #MBJ_iMGL_GFP_5
  "GSM3908567"  #MBJ_iMGL_GFP_6
)

Blurton_Sample_Locations <- which(all.samples.human %in% Blurton_Samples)
Expression_Blurton <- t(h5read(archs4.human, "data/expression", 
                               index=list(Blurton_Sample_Locations, 1:length(genes))))
H5close()
rownames(Expression_Blurton) <- genes
colnames(Expression_Blurton) <- all.samples.human[Blurton_Sample_Locations]

#Below, I access some meta-data from the study to verify that the appropriate data was 
#extracted from ArchS4, and clean up colnames to more informative labels
studyDesign_Blurton <- tibble(Sample_title_Blurton = Sample_title[Blurton_Sample_Locations], 
                              Sample_source_name_ch1_Blurton = Sample_source_name_ch1[Blurton_Sample_Locations],
                              Sample_characteristics_Blurton = Sample_characteristics[Blurton_Sample_Locations])
head(studyDesign_Blurton)
## # A tibble: 6 × 3
##   Sample_title_Blurton Sample_source_name_ch1_Blurton Sample_characteristics_B…¹
##   <chr[1d]>            <chr[1d]>                      <chr[1d]>                 
## 1 MBJ_iMGL_SAL_1       Microglia                      "cell line: AICS-0036 (Co…
## 2 MBJ_iMGL_SAL_2       Microglia                      "cell line: AICS-0036 (Co…
## 3 MBJ_iMGL_SAL_3       Microglia                      "cell line: AICS-0036 (Co…
## 4 MBJ_iMGL_LPS_1       Microglia                      "cell line: AICS-0036 (Co…
## 5 MBJ_iMGL_LPS_2       Microglia                      "cell line: AICS-0036 (Co…
## 6 MBJ_iMGL_LPS_3       Microglia                      "cell line: AICS-0036 (Co…
## # … with abbreviated variable name ¹​Sample_characteristics_Blurton
colnames(Expression_Blurton) <- studyDesign_Blurton$Sample_title_Blurton

#Remove samples unnecessary for this analysis based on sample characteristics accessed above
Expression_Blurton <- Expression_Blurton[ , c(15:24, 27:32)]              

#Create DGE list for filtering / normalizing Blurton Data
DGEList_Blurton <- DGEList(Expression_Blurton)      
Expression_Blurton_cpm <- cpm(DGEList_Blurton)                
table(rowSums(DGEList_Blurton$counts==0)==18)              
## 
## FALSE 
## 35238
#Create and reshape (melt) tibble of log transformed raw counts to visualized unfiltered / unnormalized data
Blurton_log2_cpm <- as_tibble(cpm(DGEList_Blurton, log = TRUE) , rownames = "GeneID") 
Blurton_log2_cpm_melt <- melt(Blurton_log2_cpm)

#visualize overall expression to show excessive influence of unexpressed / lowly expressed genes
ggplot(Blurton_log2_cpm_melt, aes(x=variable, y=value, fill=variable)) +     
  geom_violin(trim = FALSE, show.legend = FALSE) +
  stat_summary(fun.y = "median", 
               geom = "point", 
               shape = 124, 
               size = 6, 
               color = "black", 
               show.legend = FALSE) +
  theme(axis.text.x=element_blank()) +                    
  labs(y="log2 expression",
       title="Log2 Counts per Million (CPM)",
       subtitle="Blurton-Jones, unfiltered, non-normalized",
       caption=paste0("produced on ", Sys.time()))

#create logical vector that evaluates to TRUE when at least 6 of 16 samples express a given gene 
filtrate_Blurton<- rowSums(Expression_Blurton_cpm > 1) >= 6  
table(filtrate_Blurton)
## filtrate_Blurton
## FALSE  TRUE 
## 21817 13421
#subset DGE list so that only genes expressed in at least 6 samples remain, remove 21817 unexpressed genes
#Redo counts per million, log transformation, and reshaping to visualize the filtered dataset
DGEList_Blurton_filtered <- DGEList_Blurton[filtrate_Blurton, ] 
Blurton_log2_cpm_filtered <- as_tibble(cpm(DGEList_Blurton_filtered, log = TRUE), rownames = "GeneID") 
Blurton_log2_cpm_filtered_melt <- melt(Blurton_log2_cpm_filtered)

ggplot(Blurton_log2_cpm_filtered_melt, aes(x=variable, y=value, fill=variable)) +   
  geom_violin(trim = FALSE, show.legend = FALSE) +
  stat_summary(fun.y = "median", 
               geom = "point", 
               shape = 124, 
               size = 6, 
               color = "black", 
               show.legend = FALSE) +
  theme(axis.text.x=element_blank()) + 
  labs(y="log2 expression",
       title="Log2 Counts per Million (CPM)",
       subtitle="Blurton-Jones, filtered, non-normalized",
       caption=paste0("produced on ", Sys.time()))

#TPM normalization
DGEList_Blurton_filtered_norm <- calcNormFactors(DGEList_Blurton_filtered, method = "TMM") 
DGEList_Blurton_filtered_norm <- as_tibble(cpm(DGEList_Blurton_filtered_norm, log = TRUE), rownames = "GeneID") 
IPSC_Microglia_Blurton <- column_to_rownames(DGEList_Blurton_filtered_norm, var = "GeneID")                

The Svoboda dataset below will eventually contain 18 total samples: 6 microglia xenotransplanted for 10 days, 6 xenotransplanted for 60 days, and 6 in vitro differentiated IPSC microglia. The total number of observations (expressed genes) in this dataset is 13,433.

#Svoboda 2019 data, (IPSC-Microglia +/- Transplant ("Xenograft") into chimeric mice) ---- 

#The processed data frames containing these data are stored in tbl's 
#"IPSC_Microglia_Svoboda" (no GeneID column, gene names are row names) 
#or "DGEList_Svoboda_filtered_norm" (maintains GeneID as a separate column)

Svoboda_Samples <- c("GSM4133389", #iMP1
                    "GSM4133390",   #iMP2
                    "GSM4133391",   #iMP3
                    "GSM4133392",   #iMP4
                    "GSM4133393",   #iMP5
                    "GSM4133394",   #iMP6
                    "GSM4133395",   #In vivo iMG 10dpi rep 1
                    "GSM4133396",   #In vivo iMG 10dpi rep 2
                    "GSM4133397",   #In vivo iMG 10dpi rep 3
                    "GSM4133398",   #In vivo iMG 10dpi rep 4
                    "GSM4133399",   #In vivo iMG 10dpi rep 5
                    "GSM4133400",   #In vivo iMG 10dpi rep 6
                    "GSM4133401",   #In vivo iMG 60dpi rep 1
                    "GSM4133402",   #In vivo iMG 60dpi rep 2
                    "GSM4133403",   #In vivo iMG 60dpi rep 3
                    "GSM4133404",   #In vivo iMG 60dpi rep 4
                    "GSM4133405",   #In vivo iMG 60dpi rep 5
                    "GSM4133406",   #In vivo iMG 60dpi rep 6
                    "GSM4133407",   #In vitro iMG 10dpi rep 1
                    "GSM4133408",   #In vitro iMG 10dpi rep 2
                    "GSM4133409",   #In vitro iMG 10dpi rep 3
                    "GSM4133410",   #In vitro iMG 10dpi rep 4
                    "GSM4133411",   #In vitro iMG 10dpi rep 5
                    "GSM4133412",   #In vitro iMG 10dpi rep 6
                    "GSM4133413", #In vitro iMG 60dpi rep 1
                    "GSM4133414", #In vitro iMG 60dpi rep 2
                    "GSM4133415",   #In vitro iMG 60dpi rep 3
                    "GSM4133416",   #In vitro iMG 60dpi rep 4
                    "GSM4133417",   #In vitro iMG 60dpi rep 5
                    "GSM4133418",   #In vitro iMG 60dpi rep 6
                    "GSM4133419",   #In Vivo 2 [M543]
                    "GSM4133420",   #In Vivo 1 [M544]
                    "GSM4133421",   #In Vivo 4 [M1]
                    "GSM4133422",   #In Vivo 3 [M2]
                    "GSM4133423",   #In Vitro 1 [IV1]
                    "GSM4133424"    #In Vitro 2 [IV2]
)

Svoboda_Sample_Locations <- which(all.samples.human %in% Svoboda_Samples)
Expression_Svoboda <- t(h5read(archs4.human, "data/expression", 
                            index=list(Svoboda_Sample_Locations, 1:length(genes))))
H5close()
rownames(Expression_Svoboda) <- genes
colnames(Expression_Svoboda) <- all.samples.human[Svoboda_Sample_Locations]

#Use Meta-data to create study design file and facilitate cleaning
studyDesign_Svoboda <- tibble(Sample_title_Svoboda = Sample_title[Svoboda_Sample_Locations], 
                           Sample_source_name_ch1_Svoboda = Sample_source_name_ch1[Svoboda_Sample_Locations],
                           Sample_characteristics_Svoboda = Sample_characteristics[Svoboda_Sample_Locations])

colnames(Expression_Svoboda) <- studyDesign_Svoboda$Sample_title_Svoboda
Expression_Svoboda <- Expression_Svoboda[ , 9:26] #remove irrelevant samples

##Create DGE list for filtering and normalization
#perform cpm() and log transformation to visualize
#overall distribution of expression data
DGEList_Svoboda <- DGEList(Expression_Svoboda)      
Expression_Svoboda_cpm <- cpm(DGEList_Svoboda)                
table(rowSums(DGEList_Svoboda$counts==0)==18)              
## 
## FALSE  TRUE 
## 27100  8138
#Create and reshape (melt) tibble of log transformed raw counts to visualized unfiltered / unnormalized data
Svoboda_log2_cpm <- as_tibble(cpm(DGEList_Svoboda, log = TRUE) , rownames = "GeneID") 
Svoboda_log2_cpm_melt <- melt(Svoboda_log2_cpm)

#visualize overall expression to show excessive influence of unexpressed / lowly expressed genes
ggplot(Svoboda_log2_cpm_melt, aes(x=variable, y=value, fill=variable)) +     
  geom_violin(trim = FALSE, show.legend = FALSE) +
  stat_summary(fun.y = "median", 
               geom = "point", 
               shape = 124, 
               size = 6, 
               color = "black", 
               show.legend = FALSE) +
  theme(axis.text.x=element_blank()) +   
  labs(y="log2 expression",
       title="Log2 Counts per Million (CPM)",
       subtitle="Svoboda, unfiltered, non-normalized",
       caption=paste0("produced on ", Sys.time()))

#create logical vector that evaluates to TRUE when at least 6 of 18 samples express a given gene 
filtrate_Svoboda<- rowSums(Expression_Svoboda_cpm > 1) >= 6  
table(filtrate_Svoboda)
## filtrate_Svoboda
## FALSE  TRUE 
## 21805 13433
#subset DGE list so that only genes expressed in at least 6 samples remain. 
#This will 21805 unexpressed genes, I'll then violin plot again to visualize impact of filtering

DGEList_Svoboda_filtered <- DGEList_Svoboda[filtrate_Svoboda, ]
Svoboda_log2_cpm_filtered <- as_tibble(cpm(DGEList_Svoboda_filtered, log = TRUE), rownames = "GeneID") 
Svoboda_log2_cpm_filtered_melt <- melt(Svoboda_log2_cpm_filtered)

ggplot(Svoboda_log2_cpm_filtered_melt, aes(x=variable, y=value, fill=variable)) +   
  geom_violin(trim = FALSE, show.legend = FALSE) +
  stat_summary(fun.y = "median", 
               geom = "point", 
               shape = 124, 
               size = 6, 
               color = "black", 
               show.legend = FALSE) +
  theme(axis.text.x=element_blank()) +   
  labs(y="log2 expression",
       title="Log2 Counts per Million (CPM)",
       subtitle="Svoboda, filtered, non-normalized",
       caption=paste0("produced on ", Sys.time()))

#TPM normalization
DGEList_Svoboda_filtered_norm <- calcNormFactors(DGEList_Svoboda_filtered, method = "TMM") 
DGEList_Svoboda_filtered_norm <- as_tibble(cpm(DGEList_Svoboda_filtered_norm, log = TRUE), rownames = "GeneID") 
IPSC_Microglia_Svoboda <- column_to_rownames(DGEList_Svoboda_filtered_norm, var = "GeneID")                  

#Add "Svoboda_" to sample lables to facilitate later analysis and visualization. 
labels_Svoboda <- c(rep(NA, ncol(IPSC_Microglia_Svoboda)))     
for(m in 1:length(labels_Svoboda)){
  labels_Svoboda[m] <- paste0("Svoboda_", colnames(IPSC_Microglia_Svoboda[m]))
}
colnames(IPSC_Microglia_Svoboda) <- labels_Svoboda
colnames(DGEList_Svoboda_filtered_norm[2:ncol(DGEList_Svoboda_filtered_norm)]) <- labels_Svoboda

Now that all the datasets have been appropriately processed, I am going to join them together. We will accomplish this by adding each dataset to a list and then using purr::reduce() to apply left_join() to all elements of the list. This will create a tibble of all 118 samples with a total of 14,528 observations (genes). The number of observations comes from the Gosselin reference dataset, which had the largets number of samples (14,528).

This joining approach does introduce NA data into the aggregate tibble, since there are some genes that are not shared among all 6 constituent datasets. Removing any genes for which any samples ended up with an NA value leads to a removal of 4269 genes, leading to a final number of observations of 10,259 in the aggregate dataset. This loss of information is a key limitation of this meta-analysis approach and would need to be improved upon substantially for a publication-quality analysis.

#Data Aggregation ----

#For subsequent exploratory and DGE analysis, the generated tbl.df Aggregation_Set_Math will be used
#This tibble will also be written out to a .tsv for sharing offline/etc

Aggregation_List <- list(Gosselin_log2_cpm_filtered_norm, Eggen_log2_cpm_filtered_norm, KJS_iMG_MDM, 
                         DGEList_Abud_filtered_norm, DGEList_Blurton_filtered_norm, DGEList_Svoboda_filtered_norm)
Aggregation_Set <- Aggregation_List %>%
  purrr::reduce(left_join, by = "GeneID")
Aggregation_Set <- na.omit(Aggregation_Set) #remove NA probes
write_tsv(Aggregation_Set, "BMIN_503_Aggregated_Transcriptomes.tsv") #write to .tsv

#Move the gene names to rownames to eliminate the non numeric column, needed for PCA /
#Other statistical operations
Aggregation_Set_Math <- column_to_rownames(Aggregation_Set, var = "GeneID")  
dim(Aggregation_Set_Math)
## [1] 10259   118
#Create Study Design File for Aggregate set to facilitate DGE Analysis
#As well as further sharing of this report / dataset with colleagues. 
#Will contain sample labels and factors corresponding to Study authors
#For each dataset, condition (eg ExVivo, InVitro, Transplanted, etc),
#And cell type (microglia or macrophage, latter included as a sort of negative control)
#Other categorical variables (EG HIV infectable? yes or no, etc) could be added to this study design file
#to check other relationships between the datasets. 

labels_Aggregate <- colnames(Aggregation_Set_Math)
studies_Aggregate <- c(rep("Gosselin_2017", 24), rep("Eggen_2017", 39), rep("Ryan_2020", 6), 
                       rep("Abud_2017", 15), rep("Blurton-Jones_2019", 16), rep("Svoboda", 18))

condition_Aggregate <- c(rep("ExVivo_Juvenile", 24), rep("ExVivo_Adult", 39), rep("InVitro_Monoculture", 6),
                         rep("InVitro_rHcN_Coculture", 6), rep("InVitro_Monoculture", 9),
                         rep("Xenografted", 10), rep("InVitro_Monoculture", 6), rep("Xenografted", 6),
                         rep("Xenografted_60d", 6), rep("InVitro_Monoculture", 6))

celltype_Aggregate <- c(rep("Microglia", 63), rep("Monocyte_Derived_Macrophages", 3),
                        rep("Microglia", 52))


Study_Design_Aggregate <- tibble(Sample_Labels = labels_Aggregate, Study = studies_Aggregate,
                                 Conditions = condition_Aggregate, Cell_Types = celltype_Aggregate)
Study_Design_Aggregate <- Study_Design_Aggregate %>%
  dplyr::mutate(Study = factor(Study, levels = c("Gosselin_2017", "Eggen_2017", "Ryan_2020",
                                                 "Abud_2017", "Blurton-Jones_2019", "Svoboda"))) %>%
  dplyr::mutate(Conditions = factor(Conditions, levels = c("ExVivo_Juvenile", "ExVivo_Adult", "InVitro_Monoculture",
                                                 "InVitro_rHcN_Coculture", "Xenografted", "Xenografted_60d"))) %>%
  dplyr::mutate(Cell_Types = factor(Cell_Types, levels = c("Microglia", "Monocyte_Derived_Macrophages")))

head(Study_Design_Aggregate)
## # A tibble: 6 × 4
##   Sample_Labels       Study         Conditions      Cell_Types
##   <chr>               <fct>         <fct>           <fct>     
## 1 J_Microglia_ExVivo1 Gosselin_2017 ExVivo_Juvenile Microglia 
## 2 J_Microglia_ExVivo2 Gosselin_2017 ExVivo_Juvenile Microglia 
## 3 J_Microglia_ExVivo3 Gosselin_2017 ExVivo_Juvenile Microglia 
## 4 J_Microglia_ExVivo4 Gosselin_2017 ExVivo_Juvenile Microglia 
## 5 J_Microglia_ExVivo5 Gosselin_2017 ExVivo_Juvenile Microglia 
## 6 J_Microglia_ExVivo6 Gosselin_2017 ExVivo_Juvenile Microglia

###Results

More detailed descriptions are commented within code blocks 10 - 12

#Exploratory Analysis (PCA)----

pca_res_aggregate <- prcomp(t(Aggregation_Set_Math), scale. = F, retx = T)
ls(pca_res_aggregate)
## [1] "center"   "rotation" "scale"    "sdev"     "x"
#The rotations indicate how much each individual gene contributes to each principal component
#Sometimes referred to in formulas as "loadings"
str(pca_res_aggregate$rotation)
##  num [1:10259, 1:118] 0.00254 0.00708 0.00372 0.00669 -0.0154 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:10259] "A2M" "AAAS" "AACS" "AAGAB" ...
##   ..$ : chr [1:118] "PC1" "PC2" "PC3" "PC4" ...
head(pca_res_aggregate$rotation[1:5, 1:4])
##                PC1          PC2           PC3          PC4
## A2M    0.002535021  0.016500248 -0.0157537108 -0.004979660
## AAAS   0.007076908  0.001338595 -0.0158633633 -0.003918717
## AACS   0.003724796 -0.011160336  0.0047551736 -0.010758167
## AAGAB  0.006686283  0.001467097 -0.0001881314  0.002010675
## AAK1  -0.015402729 -0.002787139  0.0021502716  0.007428113
#Store proportion of variance from each principal component (118 total) 
#in a Vector of Eigenvalues
pc_var_aggregate <- pca_res_aggregate$sdev^2  

#Calculate percentages of variance (eg sdev^2) per principal component, for PCA plot labels
pc_per_aggregate <- round(pc_var_aggregate/sum(pc_var_aggregate)*100, 1) 

#Read the coordinates, the value of each PCA for each sample (all 118 samples), and store in a data frame.
#main data for typical PCA plots of individual samples
head(pca_res_aggregate$x[1:5, 1:10])
##                           PC1       PC2       PC3      PC4         PC5
## J_Microglia_ExVivo1 -25.02148 11.540237 -57.71525 39.16504  -7.1441225
## J_Microglia_ExVivo2 -21.04824 -3.057133 -47.11292 42.34468   0.4171218
## J_Microglia_ExVivo3 -41.77136 -3.584041 -46.61993 46.22119  -1.9162314
## J_Microglia_ExVivo4 -53.96035 30.603902 -65.13237 23.36892 -13.7107552
## J_Microglia_ExVivo5 -77.49303 14.319996 -45.53603 22.38649  -8.6809350
##                           PC6       PC7        PC8       PC9      PC10
## J_Microglia_ExVivo1  7.359722 10.751431  -7.682503  5.341852 -3.972164
## J_Microglia_ExVivo2 12.577755  8.093901 -15.497597 -7.023665  8.870289
## J_Microglia_ExVivo3  9.261728 13.072300  -8.874103 -4.054169  5.610533
## J_Microglia_ExVivo4  6.142062  4.019717  -3.386855 -4.493134 -8.995202
## J_Microglia_ExVivo5 13.785142  2.202891  13.631039 -5.285502  1.273049
pca_res_df_aggregate <- as_tibble(pca_res_aggregate$x)  

#Before creating the PCA plot of the samples, use fviz_eig to show the variation 
#due to each principal component. This is called a scree plot
#The form of this scree plot suggests tha tprincipal components 1, 2, and 3 are worth investigating / retaining

fviz_eig(pca_res_aggregate)

#Plot Principal Components 1 and 2
pca_plot_aggregate_1_2 <- ggplot(pca_res_df_aggregate, aes(x = PC2, y = PC1, 
                          color = condition_Aggregate, shape = studies_Aggregate)) +
                          geom_point(size = 3, alpha = .5) + 
                          xlab(paste0("PC2 (", pc_per_aggregate[2], "%", ")")) +
                          ylab(paste0("PC1 (", pc_per_aggregate[1], "%", ")")) +
                          labs(title = "PCA Plot") + 
                          stat_ellipse(aes(color =  condition_Aggregate), #Add Elipse to help differentiate groups
                          linewidth = 0.9, level = 0.95) 
ggplotly(pca_plot_aggregate_1_2)
#Plot Principal Components 1 and 3
pca_plot_aggregate_1_3 <- ggplot(pca_res_df_aggregate, aes(x = PC3, y = PC1, color = condition_Aggregate, shape = studies_Aggregate)) +
  geom_point(size = 3, alpha = .5) + 
  xlab(paste0("PC3 (", pc_per_aggregate[3], "%", ")")) +
  ylab(paste0("PC1 (", pc_per_aggregate[1], "%", ")")) +
  labs(title = "PCA Plot") + 
  stat_ellipse(aes(color =  condition_Aggregate), #Add Elipse to help differentiate groups
  linewidth = 0.9, level = 0.95) +
  theme_bw()
ggplotly(pca_plot_aggregate_1_3)
#Plot Principal Components 2 and 3
pca_plot_aggregate_2_3 <- ggplot(pca_res_df_aggregate, aes(x = PC3, y = PC2, color = condition_Aggregate, shape = studies_Aggregate)) +
  geom_point(size = 3, alpha = .5) + 
  xlab(paste0("PC3 (", pc_per_aggregate[3], "%", ")")) +
  ylab(paste0("PC2 (", pc_per_aggregate[2], "%", ")")) +
  labs(title = "PCA Plot") + 
  stat_ellipse(aes(color =  condition_Aggregate), #Add Elipse to help differentiate groups
               linewidth = 0.9, level = 0.95) +
  theme_bw()
ggplotly(pca_plot_aggregate_2_3)
#A goal of this analysis was to visualize and quantify the distance
#between different samples on the PCA plots. This ended up being somewhat challenging
#So the approach below is a simplified attempt to quantify the distance of 
#the various sample groups from the reference datasets by using simple centroids (eg arithmetic means of coordinates)
#of each group of samples based on their coordinates for principal components 1 and 2 and a simple
#novel distance function (called 'euclidean', introduced below)

pc1_pc2_ExVivo <- pca_res_aggregate$x[1:63, 1:2]
centroid_ExVivo <- c(mean(pc1_pc2_ExVivo[ , 1]), mean(pc1_pc2_ExVivo[ , 2]))

pc1_pc2_KJS_iMg <- pca_res_aggregate$x[67:69, 1:2]
centroid_KJS <- c(mean(pc1_pc2_KJS_iMg[ , 1]), mean(pc1_pc2_KJS_iMg[ , 2]))

pc1_pc2_Abud_iMg <- pca_res_aggregate$x[76:84, 1:2]
centroid_Abud <- c(mean(pc1_pc2_Abud_iMg[ , 1]), mean(pc1_pc2_Abud_iMg[ , 2]))

pc1_pc2_BlurtonJones_XMg <- pca_res_aggregate$x[85:94, 1:2]
centroid_BJ_XMg <- c(mean(pc1_pc2_BlurtonJones_XMg[ , 1]), mean(pc1_pc2_BlurtonJones_XMg[ , 2]))

pc1_pc2_BlurtonJones_iMg <- pca_res_aggregate$x[95:100, 1:2]
centroid_BJ_iMg <- c(mean(pc1_pc2_BlurtonJones_iMg[ , 1]), mean(pc1_pc2_BlurtonJones_iMg[ , 2]))

pc1_pc2_Svoboda_xMg <- pca_res_aggregate$x[107:112, 1:2]
centroid_Svoboda_xMg <- c(mean(pc1_pc2_Svoboda_xMg[ , 1]), mean(pc1_pc2_Svoboda_xMg[ , 2]))
  
pc1_pc2_Svoboda_iMg <- pca_res_aggregate$x[113:118, 1:2]
centroid_Svoboda_iMg <- c(mean(pc1_pc2_Svoboda_iMg[ , 1]), mean(pc1_pc2_Svoboda_iMg[ , 2]))

euclidean <- function(a, b) sqrt(sum((a - b)^2))

Dist_to_Ref <- data.frame(Samples = c("KJS iMG", "Abud_iMg", "BJ_XMg", "BJ_iMg", "Svoboda_XMg", "Svoboda_iMg"),
                          Distance = rep(NA, 6))
Dist_to_Ref$Distance[1] <- euclidean(centroid_ExVivo, centroid_KJS)
Dist_to_Ref$Distance[2] <- euclidean(centroid_ExVivo, centroid_Abud)
Dist_to_Ref$Distance[3] <- euclidean(centroid_ExVivo, centroid_BJ_XMg)
Dist_to_Ref$Distance[4] <- euclidean(centroid_ExVivo, centroid_BJ_iMg)
Dist_to_Ref$Distance[5] <- euclidean(centroid_ExVivo, centroid_Svoboda_xMg)
Dist_to_Ref$Distance[6] <- euclidean(centroid_ExVivo, centroid_Svoboda_iMg)

Dist_to_Ref
##       Samples Distance
## 1     KJS iMG 135.8215
## 2    Abud_iMg 133.8918
## 3      BJ_XMg 104.1277
## 4      BJ_iMg 161.9481
## 5 Svoboda_XMg 155.7900
## 6 Svoboda_iMg 199.6711

The differential gene expression alayisis below includes the following pairwise comparisons:

#Differential Gene Expression Analysis----

#Create a matrix of sample information and factor wrap to facilitate pairwise comparisons
matrix_DGE <- c(rep("J_Microglia_ExVivo", 24), rep("A_Microglia_ExVivo", 39), rep("Ryan_MDM_InVitro", 3), 
                rep("Ryan_iMG_InVitro", 3), rep("Abud_iMG_InVitro_rHcH1", 6), rep("Abud_iMG_InVitro_", 9),
                rep("BlurtonJones_iMg_Xenotransplant", 10), rep("BlurtonJones_iMG_InVitro", 6), 
                rep("Svoboda_iMg_Xenotransplant", 6), rep("Svoboda_iMg_Xenotransplant_60d", 6), rep("Svoboda_iMg_InVitro", 6))

matrix_DGE <- factor(matrix_DGE, levels = c("J_Microglia_ExVivo", "A_Microglia_ExVivo", "Ryan_MDM_InVitro", 
                                            "Ryan_iMG_InVitro", "Abud_iMG_InVitro_rHcH1", "Abud_iMG_InVitro_",
                                            "BlurtonJones_iMg_Xenotransplant", "BlurtonJones_iMG_InVitro", 
                                            "Svoboda_iMg_Xenotransplant", "Svoboda_iMg_Xenotransplant_60d", "Svoboda_iMg_InVitro"))
summary(matrix_DGE)
##              J_Microglia_ExVivo              A_Microglia_ExVivo 
##                              24                              39 
##                Ryan_MDM_InVitro                Ryan_iMG_InVitro 
##                               3                               3 
##          Abud_iMG_InVitro_rHcH1               Abud_iMG_InVitro_ 
##                               6                               9 
## BlurtonJones_iMg_Xenotransplant        BlurtonJones_iMG_InVitro 
##                              10                               6 
##      Svoboda_iMg_Xenotransplant  Svoboda_iMg_Xenotransplant_60d 
##                               6                               6 
##             Svoboda_iMg_InVitro 
##                               6
design_matrix <-  model.matrix(~0 + matrix_DGE)
colnames(design_matrix) <- levels(matrix_DGE)

#Use limma::voom() to model the mean:variance relationship in our dataset. Necessary 
#for providing weights in downstream linear modeling. We will then fit a linear model to 
#to the entire set of gene expression data, eg to each individual gene / probe using lmFit()

#The output linear model will then be used to evaluate contrasts for the appropriate pairwise comparisons,
#generate statistics for individual genes using eBayes(), and summarize the significant differentially expressed 
#genes using a top table. 

#Voom() Requires non-negative values, so will need to un log2 transform our aggregate dataset
#Aggregation_Set_Math_no_log still has filtered / normalized data but now the 
#Units are just transcripts per million (TPM) instead of log2(TPM) which is the typically reported metric

Aggregation_Set_Math_no_log <- 2^Aggregation_Set_Math
V_Aggregation_Set_Math <- voom(Aggregation_Set_Math_no_log, design_matrix, plot = TRUE)

#Fit a linear model to the now weighted dataset
fit <- lmFit(V_Aggregation_Set_Math, design_matrix)

#Extract the expression data back out from the EList created by voom
#and redo log transformation. Contains some NA, may have to remove later if necessary
V_Aggregation_Set_Math_tbl <- (as_tibble(log2(V_Aggregation_Set_Math$E), rownames = "GeneID"))


#Use our design matrix file to begin creating contrasts
#corresponding to our pairwise comparisons of interest
#Extracts the linear model created above, get's statistics on it
#using eBayes(), and then summarizes top differentially expressed genes / DEGs in a topTable
#Beginning with a comparison of the two reference data sets (Eggen and Gosselin)
contrast.matrix_Juv_Adlt <- makeContrasts(Juv_v_Adlt = A_Microglia_ExVivo - J_Microglia_ExVivo, levels = design_matrix)    
fits_Juv_Adlt <- contrasts.fit(fit, contrast.matrix_Juv_Adlt)
ebFit_Juv_Adlt <- eBayes(fits_Juv_Adlt)
top_hits_Juv_Adlt <- topTable(ebFit_Juv_Adlt, adjust = "BH", coef = 1, number = 1500, sort.by = "logFC")
top_hits_Juv_Adlt <- as_tibble(top_hits_Juv_Adlt, rownames = "GeneID")
gene_names_top_1 <- top_hits_Juv_Adlt$GeneID

#Now Use top table to create an enhanced volcano plot and visualize differentially expressed genes
EnhancedVolcano(top_hits_Juv_Adlt, lab = gene_names_top_1, x = 'logFC',
                y = 'adj.P.Val', title = "Adult vs Pediatric ExVivo MG",
                pCutoff = 10e-6,
                FCcutoff = 2.0,
                pointSize = 3.0,
                labSize = 4.0,
                shape = c(1, 4, 23, 25),
                colAlpha = 1)

#now determine numbers of significant differentially expressed genes, summarize results
#And store all differentially expressed genes for this comparison (in either direction)
#In a data frame, and write those results to a tsv for future sharing or GSEA
results_Juv_Adlt <- decideTests(ebFit_Juv_Adlt, method="global", adjust.method="BH", p.value = 0.01, lfc = 2)
head(results_Juv_Adlt)
## TestResults matrix
##        Contrasts
##         Juv_v_Adlt
##   A2M            0
##   AAAS           0
##   AACS           0
##   AAGAB          0
##   AAK1           0
##   AAMDC          0
summary(results_Juv_Adlt)
##        Juv_v_Adlt
## Down           68
## NotSig       9937
## Up            254
ddx_genes_Eggen_Adult<- V_Aggregation_Set_Math$E[results_Juv_Adlt[ , 1] != 0, ]
ddx_genes_Eggen_Adult <- as_tibble(ddx_genes_Eggen_Adult, rownames = "GeneID")
write_tsv(ddx_genes_Eggen_Adult, "ddx_Genes_Eggen_v_Gosselin.tsv") #write to .tsv


#Moving on to comparing Abud vs Ryan InVitro IPSC-Microglia
contrast.matrix_Abud_Ryan <- makeContrasts(Abud_v_Ryan = Ryan_iMG_InVitro - Abud_iMG_InVitro_, levels = design_matrix)    
fits_Abud_Ryan <- contrasts.fit(fit, contrast.matrix_Abud_Ryan)
ebFit_Abud_Ryan <- eBayes(fits_Abud_Ryan)
top_hits_Abud_Ryan <- topTable(ebFit_Abud_Ryan, adjust = "BH", coef = 1, number = 1500, sort.by = "logFC")
top_hits_Abud_Ryan <- as_tibble(top_hits_Abud_Ryan, rownames = "GeneID")
gene_names_top_2 <- top_hits_Abud_Ryan$GeneID

EnhancedVolcano(top_hits_Abud_Ryan, lab = gene_names_top_2, x = 'logFC',
                y = 'adj.P.Val', title = "Abud vs Ryan InVitro MG",
                pCutoff = 10e-6,
                FCcutoff = 2.0,
                pointSize = 3.0,
                labSize = 4.0,
                shape = c(1, 4, 23, 25),
                colAlpha = 1)

results_Abud_Ryan <- decideTests(ebFit_Abud_Ryan, method="global", adjust.method="BH", p.value = 0.01, lfc = 2)
head(results_Abud_Ryan)
## TestResults matrix
##        Contrasts
##         Abud_v_Ryan
##   A2M             0
##   AAAS            0
##   AACS            0
##   AAGAB           0
##   AAK1            0
##   AAMDC           0
summary(results_Abud_Ryan)
##        Abud_v_Ryan
## Down           195
## NotSig        9931
## Up             133
ddx_genes_Abud_Ryan <- V_Aggregation_Set_Math$E[results_Abud_Ryan[ , 1] != 0, ]
ddx_genes_Abud_Ryan <- as_tibble(ddx_genes_Abud_Ryan, rownames = "GeneID")
write_tsv(ddx_genes_Abud_Ryan, "ddx_Genes_Abud_v_Ryan.tsv") #write to .tsv

#Moving on to comparing Xenotransplanted Microglia from Blurton-Jones to the 60D 
#Xenotransplants from the Svoboda study
contrast.matrix_BJ_Svoboda <- makeContrasts(BJ_v_Svoboda = BlurtonJones_iMg_Xenotransplant - Svoboda_iMg_Xenotransplant_60d, levels = design_matrix)    
fits_BJ_Svoboda <- contrasts.fit(fit, contrast.matrix_BJ_Svoboda)
ebFit_BJ_Svoboda  <- eBayes(fits_BJ_Svoboda)
top_hits_BJ_Svoboda <- topTable(ebFit_BJ_Svoboda, adjust = "BH", coef = 1, number = 1500, sort.by = "logFC")
top_hits_BJ_Svoboda <- as_tibble(top_hits_BJ_Svoboda, rownames = "GeneID")
gene_names_top_3 <- top_hits_BJ_Svoboda$GeneID

EnhancedVolcano(top_hits_BJ_Svoboda, lab = gene_names_top_3, x = 'logFC',
                y = 'adj.P.Val', title = "Blurton-Jones vs Svoboda Xenotransplanted MG",
                pCutoff = 10e-6,
                FCcutoff = 2.0,
                pointSize = 3.0,
                labSize = 4.0,
                shape = c(1, 4, 23, 25),
                colAlpha = 1)

results_BJ_Svoboda <- decideTests(ebFit_BJ_Svoboda, method="global", adjust.method="BH", p.value = 0.01, lfc = 2)
head(results_BJ_Svoboda)
## TestResults matrix
##        Contrasts
##         BJ_v_Svoboda
##   A2M              0
##   AAAS             0
##   AACS             0
##   AAGAB            0
##   AAK1             0
##   AAMDC            0
summary(results_BJ_Svoboda)
##        BJ_v_Svoboda
## Down             32
## NotSig         9850
## Up              377
ddx_genes_BJ_Svoboda <- V_Aggregation_Set_Math$E[results_BJ_Svoboda[ , 1] != 0, ]
ddx_genes_BJ_Svoboda <- as_tibble(ddx_genes_BJ_Svoboda, rownames = "GeneID")
write_tsv(ddx_genes_BJ_Svoboda, "ddx_Genes_BJ_v_Svoboda.tsv") #write to .tsv

#Moving on to comparing Xenotransplanted Microglia from Svoboda study to 
#InVitro IPSC-Microglia from the Ryan Study
contrast.matrix_Svoboda_Ryan <- makeContrasts(Svoboda_v_Ryan = Ryan_iMG_InVitro - Svoboda_iMg_Xenotransplant_60d, levels = design_matrix)    
fits_Svoboda_Ryan <- contrasts.fit(fit, contrast.matrix_Svoboda_Ryan)
ebFit_Svoboda_Ryan  <- eBayes(fits_Svoboda_Ryan)
top_hits_Svoboda_Ryan <- topTable(ebFit_Svoboda_Ryan, adjust = "BH", coef = 1, number = 2500, sort.by = "logFC")
top_hits_Svoboda_Ryan <- as_tibble(top_hits_Svoboda_Ryan, rownames = "GeneID")
gene_names_top_4 <- top_hits_Svoboda_Ryan$GeneID

EnhancedVolcano(top_hits_Svoboda_Ryan, lab = gene_names_top_4, x = 'logFC',
                y = 'adj.P.Val', title = "Svoboda Xenotransplanted MG vs Ryan InVitro IPSC-MG",
                pCutoff = 10e-6,
                FCcutoff = 2.0,
                pointSize = 1.5,
                labSize = 4.0,
                shape = c(1, 4, 23, 25),
                colAlpha = 0.5)

results_Svoboda_Ryan <- decideTests(ebFit_Svoboda_Ryan, method="global", adjust.method="BH", p.value = 0.01, lfc = 2)
head(results_Svoboda_Ryan)
## TestResults matrix
##        Contrasts
##         Svoboda_v_Ryan
##   A2M               -1
##   AAAS               0
##   AACS               0
##   AAGAB              0
##   AAK1               0
##   AAMDC              0
summary(results_Svoboda_Ryan)
##        Svoboda_v_Ryan
## Down              349
## NotSig           8803
## Up               1107
ddx_genes_Svoboda_Ryan <- V_Aggregation_Set_Math$E[results_Svoboda_Ryan[ , 1] != 0, ]
ddx_genes_Svoboda_Ryan <- as_tibble(ddx_genes_Svoboda_Ryan, rownames = "GeneID")
write_tsv(ddx_genes_Svoboda_Ryan, "ddx_Genes_Svoboda_v_Ryan.tsv") #write to .tsv

Below I will perform a simplified gene set expression analysis (GSEA) using only the pairwise comparison between the Ryan and Svoboda datasets, as in the volcano plot above. I selected this comparison because these datasets had the most differentially expressed genes out of any of our sets of comparisons. This non-exhaustive analysis is included just to show the process of generating GSEA data for this sort of input.

#GSEA Analysis using Pairwise comparisons from DGE Analysis ----

#read in annotations for popular gene sets
c2cp <- read.gmt("c2.all.v7.1.symbols.gmt")
c5cp <- read.gmt("c5.all.v7.1.symbols.gmt")
cHcp <- read.gmt("c7.all.v7.1.symbols.gmt")

#arrange the top hits from this pairwise comparison in order by logFC
#Eg genes most highly upregulated by Ryan samples compared to 
#Svoboda samples are sorted to the top

top_hits_Svoboda_Ryan <- top_hits_Svoboda_Ryan %>%
  arrange(desc(logFC))

#get a list of the gene names in order of highest logFC for use in GSEA
Gene_list_Svoboda_Ryan <- top_hits_Svoboda_Ryan$logFC
names(Gene_list_Svoboda_Ryan) <- top_hits_Svoboda_Ryan$GeneID

#use our ranked list to perform GSEA, store our results in a tibble
Ryan_iMG_v_Svoboda_xMg_GSEA_res <- GSEA(Gene_list_Svoboda_Ryan, TERM2GENE = c2cp, verbose = FALSE)
Ryan_iMG_v_Svoboda_xMg_GSEA_res_df <- as_tibble(Ryan_iMG_v_Svoboda_xMg_GSEA_res)
dim(Ryan_iMG_v_Svoboda_xMg_GSEA_res_df)
## [1] 46 11
head(Ryan_iMG_v_Svoboda_xMg_GSEA_res_df)
## # A tibble: 6 × 11
##   ID         Descr…¹ setSize enric…²   NES  pvalue p.adj…³  qvalue  rank leadi…⁴
##   <chr>      <chr>     <int>   <dbl> <dbl>   <dbl>   <dbl>   <dbl> <dbl> <chr>  
## 1 GINESTIER… GINEST…      92   0.450  2.74 1.97e-9 4.23e-6 3.70e-6   780 tags=6…
## 2 VECCHI_GA… VECCHI…      35  -0.492 -2.94 6.01e-7 6.08e-4 5.32e-4   397 tags=5…
## 3 SANA_RESP… SANA_R…      23  -0.570 -2.90 1.42e-6 6.08e-4 5.32e-4   670 tags=7…
## 4 THUM_SYST… THUM_S…      82  -0.336 -2.66 8.58e-7 6.08e-4 5.32e-4   305 tags=3…
## 5 NABA_MATR… NABA_M…      96   0.392  2.40 1.20e-6 6.08e-4 5.32e-4   516 tags=4…
## 6 NABA_MATR… NABA_M…      80   0.393  2.28 7.47e-6 2.67e-3 2.34e-3   516 tags=4…
## # … with 1 more variable: core_enrichment <chr>, and abbreviated variable names
## #   ¹​Description, ²​enrichmentScore, ³​p.adjust, ⁴​leading_edge
#generate an interactive datatable to more easily visualize the results

datatable(Ryan_iMG_v_Svoboda_xMg_GSEA_res_df[ , c(1, 3, 4:7)], 
          extensions = c("KeyTable", "FixedHeader"),
          caption = "Signatures enriched in Ryan iMg (NES > 0) or Svoboda XMg (NES < 0)",
          options = list(keys = TRUE, searchHighlight = TRUE, pageLength = 10, lengthMenu = c("10", "25", "50"))) %>%
  formatRound(columns=c(2:7), digits=8)
#Plot some of the most highly enriched (either in Ryan or Svoboda) pathways by Normalized Enrichment score,
#plot also automatically includes aesthetics for count (number of genes in our data that are found in a given pathway),
#and adjusted p value

dotplot(Ryan_iMG_v_Svoboda_xMg_GSEA_res, x = "NES", showCategory = 15) + ggtitle("Dot Plot for Ryan iMg vs Svoboda xMg GSEA")

#plot key enrichments

#Several cancer pathways are enriched in the Ryan iMicroglia
#For cultured cells, this can suggest that changes associated with 
#an artificially long life on a plate are occuring. This might suggest
#that these microglia are moving further away from naturel behavior and 
#are becoming quasi-immortalized like cancer cells 
gseaplot2(Ryan_iMG_v_Svoboda_xMg_GSEA_res,
          geneSetID = 22,
          pvalue_table = FALSE,
          title = Ryan_iMG_v_Svoboda_xMg_GSEA_res$Description[22])

#Interferon gamma response is up in the xenografted microglia
#Other interferon related pathways are similarly enriched in the Svoboda
#microglia. These signatures are typically associated with anti-viral activity
#and could inticate that these microglia were behaving as expected in the xenograft 
#environment
gseaplot2(Ryan_iMG_v_Svoboda_xMg_GSEA_res,
          geneSetID = 5,
          pvalue_table = FALSE,
          title = Ryan_iMG_v_Svoboda_xMg_GSEA_res$Description[5])

### Conclusions

Xenotransplanted microglia from Svoboda and Blurton-Jones studies outperform both in-study in vitro controls and and in vitro IPSC-microglia from Abud and Ryan studies. Of the two xenotransplanted datasets, the Blurton-Jones microglia are closest to the reference dataset based on the (rudimentary) quantification of principal component distances (eg at the end of code block 10)

Xenotransplanted samples were not drastically closer to the reference ex vivo samples based on the PCA plots. Considering that these protocols are much more expensive and technically challenging / intensive than in vitro approaches, labs should carefully consider which protocol makes the most sense for them based on their specific questions and unique logistics/capacity.

The exclusion of unrelated / non—microglial cell types in aggregate PCA allows us to more realistically interpret relationships between the different microglial preparations. However, the inclusion of other myeloid cell types or partially differentiated IPSCs may be additionally informative and may be worth working into the analysis.

Differential gene expression analysis showed that datasets that grouped farther apart on the exploratory PCA showed higher numbers of differentially expressed genes, as expected. More in depth GSEA analysis, as began in block 12, would be necessary to draw conclusions about gene signatures peculiar to individual protocols.

This limited analysis has many limitations, the most critical is the loss of data that occured when aggregating the six constituent datasets, over 4000 observations. Future applications of this code would need to either more effectively join the datasets or otherwise impute this missing date. Additionally, only a small number of studies and published protocols are included in this study; a more extensive meta-analysis would need to incorporate many more studies and of course have more rigid inclusion / exclusion criteria. Finally, the datasets combined in this study may have been generated using different protocols / instruments, which is another source of variation in the final dataset.